2. Creating some initial visualisations
Fires per year
year_plot <- fires_small %>%
group_by(fire_year) %>%
summarise(num_fires =n())
`summarise()` ungrouping output (override with `.groups` argument)
year_plot %>%
ggplot +
aes(x = fire_year, y = num_fires) +
geom_point() +
ylim(0, 120000)

# geom_col(fill = "dark blue", col ="white") +
# geom_smooth(method = "lm", se = FALSE, colour = "red")
There is a lot of variation in the data between years. Visually it looks like a repeating pattern is occurring every 5 years or so with 4 peaks visible within this reporting period. Having looked at the historic weather for that date range these peaks seems to coincide with recorded heatwaves in 2000, 2006 and 2011.(1)
https://en.wikipedia.org/wiki/List_of_heat_waves
model <- lm(formula = num_fires ~ fire_year, data = year_plot)
summary(model)
Call:
lm(formula = num_fires ~ fire_year, data = year_plot)
Residuals:
Min 1Q Median 3Q Max
-16835 -8688 -2049 9226 34793
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -609543.8 756667.9 -0.806 0.429
fire_year 343.3 377.7 0.909 0.373
Residual standard error: 12810 on 22 degrees of freedom
Multiple R-squared: 0.03621, Adjusted R-squared: -0.007601
F-statistic: 0.8265 on 1 and 22 DF, p-value: 0.3731
tidy(model)
clean_names(glance(model))
autoplot(model)
`arrange_()` is deprecated as of dplyr 0.7.0.
Please use `arrange()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.

year_plot %>%
add_predictions(model) %>%
add_residuals(model)
Warning message:
In `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
replacement element 1 has 1 row to replace 0 rows
year_plot
year_plot %>%
ggplot(aes(x = fire_year)) +
geom_point(aes(y = num_fires)) +
geom_abline(
intercept = model$coefficients[1],
slope = model$coefficients[2],
col = "red"
) +
ylim(0, 120000)

NA
bootstrap <- year_plot %>%
specify(formula = num_fires ~ fire_year) %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "slope")
bootstrap %>%
visualise(bins= 30) +
shade_ci(endpoints = slope_ci95)

slope_ci95 <- bootstrap %>%
get_ci(level = 0.95, type = "percentile")
slope_ci95
NA
clean_names(tidy(model, conf.int = TRUE, conf.level = 0.95))
Fires per day
fires_small %>%
group_by(discovery_date) %>%
summarise(num_fires = n()) %>%
ggplot +
aes(x = discovery_date, y = num_fires) +
geom_line(col = "dark blue")
`summarise()` ungrouping output (override with `.groups` argument)

This shows a typical time series plot with a cyclic variation due to warmer weather in the summer time.
Fires per month
fires_small %>%
mutate(year_month = make_date(fire_year, discovery_moy)) %>%
group_by(year_month) %>%
summarise(num_fires = n()) %>%
ggplot +
aes(x = year_month, y = num_fires) +
geom_line(col = "dark blue")
`summarise()` ungrouping output (override with `.groups` argument)

Peaks are still shown to be occurring in the summer. The 2006 heatwave is especially visable.
Fires by day of year
fires_small %>%
group_by(discovery_doy) %>%
summarise(num_fires = n()) %>%
ggplot(aes(x = discovery_doy, y = num_fires)) +
geom_line(col = "dark blue")
`summarise()` ungrouping output (override with `.groups` argument)

The are peaks around day 60-110 and a big peak around 180.
Checking the data to see where the peak occurs
fires_small %>%
group_by(discovery_doy) %>%
summarise(num_fires = n()) %>%
arrange(desc(num_fires))
`summarise()` ungrouping output (override with `.groups` argument)
The 2 highest days of the year are on 185 and 186, which happens to be Independence Day (4th July) on a normal year and a leap year retrospectively. So I imagine most of the extra fires (literally over double the normal amount) are caused by fireworks.
Fires by month of year
fires_small %>%
group_by(discovery_moy) %>%
summarise(num_fires = n()) %>%
ggplot(aes(x = discovery_moy, y = num_fires)) +
geom_col(fill = "dark blue", col = "white")
`summarise()` ungrouping output (override with `.groups` argument)

There are 2 definite peaks during the year. March and April are probably due to the US “Spring Break”, where schools and Universities are stopped and so families are likely to be on vacation during that period possibly visiting National Parks. July and August is also Summer Break for school with both families visiting Parks and hot weather likely causes of fire outbreaks.
Fires by cause
options(scipen = 999)
fires_small %>%
group_by(stat_cause_descr) %>%
summarise(num_fires = n()) %>%
ggplot +
aes(reorder(x = stat_cause_descr, num_fires), y = num_fires) +
geom_col(fill = "dark blue") +
coord_flip()
`summarise()` ungrouping output (override with `.groups` argument)

Fire avg size by cause
fires_small %>%
group_by(stat_cause_descr) %>%
summarise(avg_size = mean(fire_size)) %>%
ggplot +
aes(reorder(x = stat_cause_descr, avg_size), y = avg_size) +
geom_col(fill = "dark blue") +
coord_flip()
`summarise()` ungrouping output (override with `.groups` argument)

Avg burn time by cause
fires_small %>%
summarise(num_na = sum(is.na(cont_date)))
Literally half the data is missing for burn time, making it very difficult to do any meaningful analysis
Fires by size
fires_small %>%
group_by(fire_size_class) %>%
summarise(num_fires = n()) %>%
ggplot +
aes(x = fire_size_class, y = num_fires, fill = fire_size_class) +
geom_col() +
scale_fill_manual(values = c("red", "orange", "yellow", "green", "blue",
"purple", "black"),
name = "Fire Size Classification",
breaks = c("A", "B", "C", "D", "E", "F", "G"),
labels = c("A: < 1/4 acre", "B: 1/4 to 10 acres", "C: 10 to 100 acres",
"D: 100 to 300 acres", "E: 300 to 1000 acres",
"F: 1000 to 5000 acres", "G: More than 5000 acres"))
`summarise()` ungrouping output (override with `.groups` argument)

3. Geo Spatial Visualisations
The dataset has a cause of fire column so I shall now create some causation plots.
Getting list of fire causes
fires_states %>%
distinct(stat_cause_descr) %>%
arrange(-desc(stat_cause_descr))
Wildfires caused by Arson
cause("Arson")
`summarise()` ungrouping output (override with `.groups` argument)

Arson does seem more prevalent in the SE states of Mississippi, Georgia, Alabama and also the western state of California.
Wildfires caused by Campfire
cause("Campfire")
`summarise()` ungrouping output (override with `.groups` argument)

Campfires are the most prevalent in the Western states of Oregon, California and Arizona.
Wildfires caused by Children
cause("Children")
`summarise()` ungrouping output (override with `.groups` argument)

Fires by children are spread about the country, but the most prevalent states are California in the West, Alabama and South Carolina and New Jersey in the east.
Wildfires caused by Debris Burning
cause("Debris Burning")
`summarise()` ungrouping output (override with `.groups` argument)

Fires by burning debris are mostly in the southern warmer states of Texas, Georgia and North Carolina.
Wildfires caused by Equiment Use
cause("Equipment Use")
`summarise()` ungrouping output (override with `.groups` argument)

Most of the fires caused by equipment seem to be in California
Wildfires caused by Fireworks
cause("Fireworks")
`summarise()` ungrouping output (override with `.groups` argument)

Most of the fires caused by fireworks seem to be in the north of the country. Primarily South Dakota, Montana and Washington state.
Wildfires caused by Lightning
cause("Lightning")
`summarise()` ungrouping output (override with `.groups` argument)

Apart from a hotspot of lightning strikes in Florida, the vast majority of fires caused by lightning are in the West of the country. With the 3 most affected states being California, Oregon and Arizona.
Wildfires caused by Miscellious
cause("Miscellaneous")
`summarise()` ungrouping output (override with `.groups` argument)

There seems to be quite a few miscellaneous classifications in California, Texas and New York.
Wildfires caused by Missing/Undefined
cause("Missing/Undefined")
`summarise()` ungrouping output (override with `.groups` argument)

The states with the most missing or undefined data is North and South Carolina, Oklahoma and California.
Wildfires caused by Railroad
cause("Railroad")
`summarise()` ungrouping output (override with `.groups` argument)

By far Florida has the most wildfires caused by railroads.
Wildfires caused by Smoking
cause("Smoking")
`summarise()` ungrouping output (override with `.groups` argument)

Fires caused by smoking seem to be spread around the country, but mainly on the east and west coasts.
Wildfires caused by Structure
cause("Structure")
`summarise()` ungrouping output (override with `.groups` argument)

South Dakota has the largest proportion of fires caused by structures.
Unsurprisingly the southern states seem to have more occurences of wildifre in general, no doubt due to the warmer climate at their latitudes. Also the 1st and 3rd states with the highest number of fires are also the 2 largest States by size. However the 2nd highest State is Georgia, which although it is in the South of the country is only an average sized State. Therefore to get a better picture of what is going on I’m going to look at the proportion of fires occuring by square mile by normalising the State size.
The dataset package also has the area in square miles of each state included in the state.area vector.
state.area
[1] 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876
[11] 6450 83557 56400 36291 56290 82264 40395 48523 33215 10577
[21] 8257 58216 84068 47716 69686 147138 77227 110540 9304 7836
[31] 121666 49576 52586 70665 41222 69919 96981 45333 1214 31055
[41] 77047 42244 267339 84916 9609 40815 68192 24181 56154 97914
length(state.area)
[1] 50
Annoyingly it also only has 50 states not 52 so I will need to add in DC and PR back in.
(Area figures obtained from Wikipedia)
DC = 68 miles^2 PR = 3515 miles^2
# To make my life easier I'm going to remove the state.abb and .name files and make the tibble again, adding in the land area figures at the same time to make sure they are in the correct order.
rm(state.abb)
rm(state.name)
state.abb <- append(state.abb, c("DC", "PR"))
state.name <- append(state.name, c("District of Columbia", "Puerto Rico"))
state.area <- append(state.area, c("68", "3515"))
state_list <- tibble(state = state.abb, region = tolower(state.name), area = as.numeric(state.area))
# Re-joining tibbles
fires_states <- fires_small %>%
left_join(state_list, by = "state")
Normalising States area sizes
fires_states %>%
select(region, area) %>%
group_by(region, area) %>%
summarise(num_fires = n()) %>%
mutate(fires_sqmile = num_fires / area) %>%
arrange(desc(fires_sqmile))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
This table shows Puerto Rico has the highest proportion of fires compared to its size, followed by New Jersey in the NE of the country and finally by the States in the SE of the country.
fires_states %>%
select(region, area) %>%
group_by(region, area) %>%
summarise(num_fires = n()) %>%
mutate(fires_sqmile = num_fires / area) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = fires_sqmile)) +
geom_polygon() +
geom_path(color = "white") +
scale_fill_distiller(name = "Fire per Sq Mile", palette = "PuBuGn") +
theme_map() +
coord_map("mollweide") +
ggtitle(paste0("Total US Wildfires per Square Mile from 1992-2015")) +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)

Puerto Rico is not shown on this map, but visually we can see the data for the other 51 entries, and the south eastern states still have the highest proportion of wildfires. Interestingly New Jersey also shows has a hotspot in the NE of the country.
Do causes change over time?
Splitting causes into 2 group for legibility.
The first group is for directly man created fires.
fires_states %>%
select(stat_cause_descr, fire_year) %>%
group_by(fire_year, stat_cause_descr) %>%
filter(stat_cause_descr == "Arson" | stat_cause_descr == "Campfire" |
stat_cause_descr == "Children" | stat_cause_descr == "Equipment Use" |
stat_cause_descr == "Fireworks" | stat_cause_descr == "Smoking") %>%
summarise(num_fires = n()) %>%
ggplot +
aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
geom_line()
`summarise()` regrouping output by 'fire_year' (override with `.groups` argument)

The 2 large peaks in Arson are obvious in 1999 and 2006. There was a large heatwave in 2006, but I’m not sure why this would result in an increase in arson. Unless this was just due to the dry ground creating extra fuel to aid the spread of fires that would have normally not resulted in a large scale fire. This may also be the same reason that there is also another peak in 2006 for Equipment Use. Arson however does look to be decreasing since 2006.
And this one for natural occuring fires.
fires_states %>%
select(stat_cause_descr, fire_year) %>%
group_by(fire_year, stat_cause_descr) %>%
filter(stat_cause_descr == "Debris Burning" | stat_cause_descr == "Lightning" |
stat_cause_descr == "Miscellaneous" | stat_cause_descr ==
"Missing/Undefined" | stat_cause_descr == "Powerline" |
stat_cause_descr == "Railroad" | stat_cause_descr == "Structure") %>%
summarise(num_fires = n()) %>%
ggplot +
aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
geom_line()
`summarise()` regrouping output by 'fire_year' (override with `.groups` argument)

Similar peaks can be seen in Debris, Miscellaneous and lightning in the heatwave of 2006 that left the ground very dry. There are peaks from 1997 to 2003 in debris, miscellaneous and lightening, but also a trough in missing/undefined, so this is likely to be due to more accurate classification of fires and not using the missing/undefined category as much.
Difference in causes between states
state_map_southern <- state_map %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana")
fires_states %>%
filter(fire_year == "1992" | fire_year == "1993" | fire_year == "1994" |
fire_year == "1995") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 1992-1995") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

ggsave("1992-1995.png")
Saving 7 x 7 in image
fires_states %>%
filter(fire_year == "1996" | fire_year == "1997" | fire_year == "1998" |
fire_year == "1999") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 1996-1999") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

ggsave("1996-1999.png")
Saving 7 x 7 in image
fires_states %>%
filter(fire_year == "2000" | fire_year == "2001" | fire_year == "2002" |
fire_year == "2003") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 2000-2003") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

ggsave("2000-2003.png")
Saving 7 x 7 in image
fires_states %>%
filter(fire_year == "2004" | fire_year == "2005" | fire_year == "2006" |
fire_year == "2007") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 2004-2007") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

ggsave("2004-2007.png")
Saving 7 x 7 in image
fires_states %>%
filter(fire_year == "2008" | fire_year == "2009" | fire_year == "2010" |
fire_year == "2011") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 2008-2011") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

ggsave("2008-2011.png")
Saving 7 x 7 in image
fires_states %>%
filter(fire_year == "2012" | fire_year == "2013" | fire_year == "2014" |
fire_year == "2015") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 2012-2015") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

ggsave("2012-2015.png")
Saving 7 x 7 in image
Looking at these trends some interesting insights can be seen. For the combined years data Florida stands out as having railroad as its main cause of wildfire, but from the above plots it can be seen that these railroad fires are only the main cause up to the 4 yearly period ending in 2003 and then the main cause changes to lightning until the end of the collection period in 2015. Similarly arson seem reasonably popular in the southern states until 2007, when it no longer appears as the most common cause of wildfire. This downward trend was also noted earlier in the overall causation plots for all states
Correlation between states and fire size
fires_states %>%
select(region, fire_size_class) %>%
group_by(region, fire_size_class) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = fire_size_class)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Fire Size Class", palette = "PuBuGn") +
ggtitle("Most common wildfire size per State 1992-2015") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
select(region, fire_size_class) %>%
filter(fire_size_class == "G") %>%
group_by(region) %>%
summarise(num_fire = n()) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = num_fire)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_distiller(name = "Number of Fires", palette = "PuBuGn") +
ggtitle("Number of large class G fires per State 1992-2015") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` ungrouping output (override with `.groups` argument)

From the plots we can see that the Western states have the most small fires and also the most large fires! Not entirely the most helpful plots…
Are fires more prevalent in certain months for individual states
fires_states %>%
filter(fire_year == "1992" | fire_year == "1993" | fire_year == "1994" |
fire_year == "1995") %>%
select(region, discovery_moy) %>%
group_by(region, discovery_moy) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = discovery_moy)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
ggtitle("Month with most fires per State 1992-1995") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "1996" | fire_year == "1997" | fire_year == "1998" |
fire_year == "1999") %>%
select(region, discovery_moy) %>%
group_by(region, discovery_moy) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = discovery_moy)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
ggtitle("Month with most fires per State 1996-1999") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "2000" | fire_year == "2001" | fire_year == "2002" |
fire_year == "2003") %>%
select(region, discovery_moy) %>%
group_by(region, discovery_moy) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = discovery_moy)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
ggtitle("Month with most fires per State 2000-2003") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "2004" | fire_year == "2005" | fire_year == "2006" |
fire_year == "2007") %>%
select(region, discovery_moy) %>%
group_by(region, discovery_moy) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = discovery_moy)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
ggtitle("Month with most fires per State 2004-2007") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "2008" | fire_year == "2009" | fire_year == "2010" |
fire_year == "2011") %>%
select(region, discovery_moy) %>%
group_by(region, discovery_moy) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = discovery_moy)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
ggtitle("Month with most fires per State 2008-2011") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "2012" | fire_year == "2013" | fire_year == "2014" |
fire_year == "2015") %>%
select(region, discovery_moy) %>%
group_by(region, discovery_moy) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = discovery_moy)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
ggtitle("Month with most fires per State 2012-2015") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "2010" | fire_year == "2011" | fire_year == "2012" |
fire_year == "2013" | fire_year == "2014" | fire_year == "2015") %>%
select(region, discovery_moy) %>%
group_by(region, discovery_moy) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = discovery_moy)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
ggtitle("Month with most fires in per State 2010-2015") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

The above plots are quite interesting. The months of the year that have the most seems to widely change in certain state. Mainly the east half of the country have the most fires in the Spring (Feb-May) and the western part of the country have the most fires later on in Summer and Fall (Jun-Oct). There are however a few exceptions that can be seen in the 2004-2007 and 2008-2011 data Texas has the most fires in January. Florida also mostly conformed to the East/West split with the majority of its worst months for fires taking place in March or April up until 2007, then the most common month moves later into June and July for the rest of the reporting period until 2015. This may have to due with main cause of fires in Florida changing from railroad to lightning related about the same time, as we noted earlier on when looking at causation. As July is the main month for tropical storms and lightning in Florida this is a possible cause for the highest month becoming later in the year than before. (2)
- https://www.weather.gov/mlb/fl_lightning_climo
---
title: "R Notebook"
output: html_notebook
---

```{r}
library(tidyverse)
library(RSQLite)
library(dbplyr)
library(janitor)
library(lubridate)
library(datasets)
library(ggthemes)
library(gganimate)
library(modelr)
library(broom)
library(ggfortify)
library(infer)
```


# 1.  Data Cleaning


####  Creating connection to the sqlite database and downloading fires dataset

```{r}
# Connecting

conn <- dbConnect(SQLite(), "raw_data/FPA_FOD_20170508.sqlite")
```

```{r}
# Pulling all the names of the tables in the database file

as.data.frame(dbListTables(conn))
```

```{r}
# Making fires dataframe

fires <- tbl(conn, "Fires") %>% collect()
```


#### Seeing what other useful information is in the database.  The majority are part of the database structure and are not readable in R.

```{r}
# EPSG worldwide geodetic parameter dataset system
spatial_ref <- tbl(conn, "spatial_ref_sys_all") %>% collect()

# National Wildfire Coordinating Group unit abbreviations 
NWGG <- tbl(conn, "NWCG_UnitIDActive_20170109") %>% collect()
```


```{r}
# Disconnect

dbDisconnect(conn)
```


### Selecting columns of interest

```{r}
fires_small <- fires %>%
  select(NWCG_REPORTING_AGENCY, SOURCE_REPORTING_UNIT_NAME, FIRE_NAME,
         FIRE_YEAR, DISCOVERY_DATE, DISCOVERY_DOY, DISCOVERY_TIME, CONT_DATE,
         CONT_DOY, CONT_TIME, STAT_CAUSE_CODE, STAT_CAUSE_DESCR, FIRE_SIZE, 
         FIRE_SIZE_CLASS, LATITUDE, LONGITUDE, OWNER_CODE, OWNER_DESCR, STATE, 
         COUNTY, FIPS_CODE, FIPS_NAME, Shape)

fires_small <- clean_names(fires_small)
```


### Changing some columms to be factors

```{r}
fires_small <- fires_small %>%
  mutate(nwcg_reporting_agency = as.factor(nwcg_reporting_agency)) %>%
  mutate(stat_cause_code = as.factor(stat_cause_code)) %>%
  mutate(fire_size_class = as.factor(fire_size_class)) %>%
  mutate(owner_descr = as.factor(owner_descr)) %>%
  mutate(state = as.factor(state)) 
```


### Date is in Julian format, so overwriting with Gregorian format using year and day of year columns.  Also adding in a 'month of year column' for future use.

```{r}
fires_small <- fires_small %>%
  mutate(date_origin = as.Date(paste0(fire_year, "-01-01"))) %>%
  mutate(discovery_date = as.Date(discovery_doy, origin = date_origin)) %>%
  mutate(discovery_moy = month(discovery_date, label = TRUE)) %>%
  select(-date_origin)
```



# 2. Creating some initial visualisations


### Fires per year

```{r}
year_plot <- fires_small %>%
  group_by(fire_year) %>%
  summarise(num_fires =n())

year_plot %>%
  ggplot +
  aes(x = fire_year, y = num_fires) +
  geom_point() +
  ylim(0, 120000)
  # geom_col(fill = "dark blue", col ="white") +
  # geom_smooth(method = "lm", se = FALSE, colour = "red")

```
**There is a lot of variation in the data between years.  Visually it looks like a repeating pattern is occurring every 5 years or so with 4 peaks visible within this reporting period.  Having looked at the historic weather for that date range these peaks seems to coincide with recorded heatwaves in 2000, 2006 and 2011.(1)**

https://en.wikipedia.org/wiki/List_of_heat_waves



```{r}
model <- lm(formula = num_fires ~ fire_year, data = year_plot)
summary(model)
```
```{r}
tidy(model)
```

```{r}
clean_names(glance(model))
```

```{r}
autoplot(model)
```


```{r}
year_plot %>%
  add_predictions(model) %>%
  add_residuals(model)
year_plot
```

```{r}
year_plot %>%
  ggplot(aes(x = fire_year)) +
  geom_point(aes(y = num_fires)) +
  geom_abline(
    intercept = model$coefficients[1],
    slope = model$coefficients[2],
    col = "red"
  ) +
  ylim(0, 120000)
  
```






```{r}
bootstrap <- year_plot %>%
  specify(formula = num_fires ~ fire_year) %>%
  generate(reps = 10000, type = "bootstrap") %>%
  calculate(stat = "slope")

bootstrap %>%
  visualise(bins= 30) +
  shade_ci(endpoints = slope_ci95)

slope_ci95 <- bootstrap %>%
  get_ci(level = 0.95, type = "percentile") 
slope_ci95

```

```{r}
clean_names(tidy(model, conf.int = TRUE, conf.level = 0.95))
```








### Fires per day

```{r}
fires_small %>%
  group_by(discovery_date) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = discovery_date, y = num_fires) +
  geom_line(col = "dark blue") 

```

**This shows a typical time series plot with a cyclic variation due to warmer weather in the summer time.**


### Fires per month

```{r}
fires_small %>%
  mutate(year_month = make_date(fire_year, discovery_moy)) %>%
  group_by(year_month) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = year_month, y = num_fires) +
  geom_line(col = "dark blue") 
```

**Peaks are still shown to be occurring in the summer. The 2006 heatwave is especially visable.**


### Fires by day of year


```{r}
fires_small %>%
  group_by(discovery_doy) %>%
  summarise(num_fires = n()) %>%
  ggplot(aes(x = discovery_doy, y = num_fires)) +
  geom_line(col = "dark blue")
```


**The are peaks around day 60-110 and a big peak around 180.**

#### Checking the data to see where the peak occurs

```{r}
fires_small %>%
  group_by(discovery_doy) %>%
  summarise(num_fires = n()) %>%
  arrange(desc(num_fires))
```
**The 2 highest days of the year are on 185 and 186, which happens to be Independence Day (4th July) on a normal year and a leap year retrospectively.  So I imagine most of the extra fires (literally over double the normal amount) are caused by fireworks.**



### Fires by month of year

```{r}
fires_small %>%
  group_by(discovery_moy) %>%
  summarise(num_fires = n()) %>%
  ggplot(aes(x = discovery_moy, y = num_fires)) +
  geom_col(fill = "dark blue", col = "white")
```

**There are 2 definite peaks during the year.  March and April are probably due to the US "Spring Break", where schools and Universities are stopped and so families are likely to be on vacation during that period possibly visiting National Parks.  July and August is also Summer Break for school with both families visiting Parks and hot weather likely causes of fire outbreaks.**




### Fires by cause

```{r}
options(scipen = 999)

fires_small %>%
  group_by(stat_cause_descr) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(reorder(x = stat_cause_descr, num_fires), y = num_fires) +
  geom_col(fill = "dark blue") + 
  coord_flip() 
```


### Fire avg size by cause

```{r}
fires_small %>%
  group_by(stat_cause_descr) %>%
  summarise(avg_size = mean(fire_size)) %>%
  ggplot +
  aes(reorder(x = stat_cause_descr, avg_size), y = avg_size) +
  geom_col(fill = "dark blue") + 
  coord_flip()
```


### Avg burn time by cause

```{r}
fires_small %>%
  summarise(num_na = sum(is.na(cont_date)))
```
*Literally half the data is missing for burn time, making it very difficult to do any meaningful analysis*



### Fires by size


```{r}
fires_small %>%
  group_by(fire_size_class) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_size_class, y = num_fires, fill = fire_size_class) +
  geom_col() +
  scale_fill_manual(values = c("red", "orange", "yellow", "green", "blue", 
                               "purple", "black"),
                    name = "Fire Size Classification",
                    breaks = c("A", "B", "C", "D", "E", "F", "G"),
                    labels = c("A: < 1/4 acre", "B: 1/4 to 10 acres", "C: 10 to 100 acres",
                               "D: 100 to 300 acres", "E: 300 to 1000 acres",
                               "F: 1000 to 5000 acres", "G: More than 5000 acres"))

```



# Geo Spatial wrangling 


### To make it easier to visually detect frequency of wildfires between states I want display it in a map format.  As I'm using ggplot2 already I'm going to also use it for maps with the `geom_polygon()`, `coord_map()` along with the ggthemes `theme_map()` functions.


#### I'm not entirely sure what geo-spatial information is being held with in the sqlite database file, I've made a few attempts to retrieve it but have been unsuccessful.  Therefore I'm going to utelise the `datasets` package which includes various bits of information on the US States, including coordinates for state boundaries.


```{r}
# State boundary co-ordinates from 'datasets' package

state_map <- map_data("state")
state_map
```


#### Annoyingly it doesn't have the abbreviation of the State, only the full name so I need to add that in.  Luckily the 'datasets' package also has a vector of States names and abbreviations so I shall make a tibble with them both in.


```{r}
state.abb
```

```{r}
state.name
```

```{r}
state_list <- tibble(state = state.abb, state_name = state.name)
state_list
```


#### The `state_map` dataframe is in lower case and has the column name 'region'.  I shall change the `state_list` tibble to be the same format so they can be joined together.


```{r}
state_list <- tibble(state = state.abb, region = tolower(state.name))
```


#### Joing `state_list` to `fires_small` datasets

```{r}
fires_states <- fires_small %>%
  left_join(state_list, by = "state")

fires_states
```


#### Checking the join has worked and there are no missing values.

```{r}
fires_states %>%
  filter(is.na(region))
```


#### There does seem to be 22,147 NAs in the 'region' column we just made.  Scrolling through there are 2 missing States of 'PR' and 'DC' in the `states_list` tibble.

#### After some quick research it seems that there are only 50 States in the US. Washington DC is techincally not counted as a state but as a Federal District, as it is the seat of government, so that was why it wasn't included in the `States` tibble originally.  PR is Puerto Rico and is also not a state but the largest US territory .


#### I shall add DC and PR into the state_list and re-join it.

```{r}
# Adding 2 new states

state.abb <- append(state.abb, c("DC", "PR"))
state.name <- append(state.name, c("District of Columbia", "Puerto Rico"))

state_list <- tibble(state = state.abb, region = tolower(state.name))
```


```{r}
# Re-joing tibbles

fires_states <- fires_small %>%
  left_join(state_list, by = "state")
```


```{r}
# Checking the join has worked properly and there are no NAs

fires_states %>%
  filter(is.na(region))
```

```{r}
# Code below brings up a "vector memory exhausted (limit reached?)" error

# fires_joined <- fires_states %>%
#  right_join(state_map, by = "region")
```


#### The data set and geo information is too big to join so I'm going to do a summarise first to get the number of fires per region first.

```{r}
fires_joined <- fires_states %>% 
    select(region) %>%
    group_by(region) %>%
    summarise(num_fires = n()) %>%
    right_join(state_map, by = "region")
```

**Result!!  Now doing first geo spatial visualisation**


### Total Wildfires per state from 1992-2015

```{r}
fires_joined %>% 
    ggplot +
    (aes(x = long, y = lat, group = group, fill = num_fires)) + 
    geom_polygon() + 
    geom_path(color = "white") + 
    scale_fill_continuous(low = "darkblue", 
                          high = "darkred",
                          name = "Number of fires") + 
    theme_map() + 
    coord_map("mollweide") + 
    ggtitle("Total US Wildfires from 1992-2015") + 
    theme(plot.title = element_text(hjust = 0.5))
```


# 3. Geo Spatial Visualisations

### The dataset has a cause of fire column so I shall now create some causation plots.


#### Getting list of fire causes

```{r}
fires_states %>%
  distinct(stat_cause_descr) %>%
  arrange(-desc(stat_cause_descr))
```


### Total fire by cause in tabular form

```{r}
fires_states %>%
  select(stat_cause_descr) %>%
  group_by(stat_cause_descr) %>%
  summarise(num_fires = n ()) %>%
  arrange(desc(num_fires))
  
```

### Number of fires by state in tabular form

```{r}
fires_states %>%
  select(region) %>%
  group_by(region) %>%
  summarise(num_fires = n()) %>%
  arrange(desc(num_fires))
```


#### As the cause needs to be filtered before the map join, I'm going to either going to have to repeat a whole load of the same code in every single plot or write a function that will do it for me with, saving a lot of typing!

```{r}
# Function for plotting cause of fire

cause <- function(cause) {
  fires_states %>%
    filter(stat_cause_descr == cause) %>%
    select(region) %>%
    group_by(region) %>%
    summarise(num_fires = n ()) %>%
    right_join(state_map, by = "region") %>%
    ggplot +
    (aes(x = long, y = lat, group = group, fill = num_fires)) + 
    geom_polygon() + 
    geom_path(color = "white") + 
    scale_fill_continuous(low = "darkblue", 
                          high = "darkred",
                          name = "Number of fires") + 
    theme_map() + 
    coord_map("mollweide") + 
    ggtitle(paste0("Total US Wildfires caused by ", cause, " from 1992-2015")) + 
    theme(plot.title = element_text(hjust = 0.5))
}
```



### Wildfires caused by Arson

```{r}
cause("Arson")
```

**Arson does seem more prevalent in the SE states of Mississippi, Georgia, Alabama and also the western state of California.**


### Wildfires caused by Campfire

```{r}
cause("Campfire")
```

**Campfires are the most prevalent in the Western states of Oregon, California and Arizona.**


### Wildfires caused by Children

```{r}
cause("Children")
```

**Fires by children are spread about the country, but the most prevalent states are California in the West, Alabama and South Carolina and New Jersey in the east.**


### Wildfires caused by Debris Burning

```{r}
cause("Debris Burning")
```

**Fires by burning debris are mostly in the southern warmer states of Texas, Georgia and North Carolina.**

### Wildfires caused by Equiment Use

```{r}
cause("Equipment Use")
```

**Most of the fires caused by equipment seem to be in California**


### Wildfires caused by Fireworks

```{r}
cause("Fireworks")
```

**Most of the fires caused by fireworks seem to be in the north of the country.  Primarily South Dakota, Montana and Washington state.**


### Wildfires caused by Lightning

```{r}
cause("Lightning")
```

**Apart from a hotspot of lightning strikes in Florida, the vast majority of fires caused by lightning are in the West of the country.  With the 3 most affected states being California, Oregon and Arizona.**

### Wildfires caused by Miscellious

```{r}
cause("Miscellaneous")
```

**There seems to be quite a few miscellaneous classifications in California, Texas and New York.**


### Wildfires caused by Missing/Undefined

```{r}
cause("Missing/Undefined")
```

**The states with the most missing or undefined data is North and South Carolina, Oklahoma and California.**


### Wildfires caused by Powerline

```{r}
cause("Powerline")
```

**Texas has the largest amount of wildfires caused by powerlines.  This is likely due to the warm climate and the large proportion of the state that is dry grasslands used for agriculture. (1) **

(1) https://uk.reuters.com/article/us-wildfires-texas/trees-and-power-lines-caused-major-texas-fire-idUSTRE78J76A20110920


### Wildfires caused by Railroad

```{r}
cause("Railroad")
```

**By far Florida has the most wildfires caused by railroads.**


### Wildfires caused by Smoking

```{r}
cause("Smoking")
```

**Fires caused by smoking seem to be spread around the country, but mainly on the east and west coasts.**


### Wildfires caused by Structure

```{r}
cause("Structure")
```

**South Dakota has the largest proportion of fires caused by structures.**



#### Unsurprisingly the southern states seem to have more occurences of wildifre in general, no doubt due to the warmer climate at their latitudes.  Also the 1st and 3rd states with the highest number of fires are also the 2 largest States by size. However the 2nd highest State is Georgia, which although it is in the South of the country is only an average sized State.  Therefore to get a better picture of what is going on I'm going to look at the proportion of fires occuring by square mile by normalising the State size.

#### The `dataset` package also has the area in square miles of each state included in the `state.area` vector.

```{r}
state.area
```

```{r}
length(state.area)
```

#### Annoyingly it also only has 50 states not 52 so I will need to add in DC and PR back in.  

(Area figures obtained from Wikipedia)

DC = 68 miles^2
PR = 3515 miles^2


```{r}
# To make my life easier I'm going to remove the state.abb and .name files and make the tibble again, adding in the land area figures at the same time to make sure they are in the correct order.

rm(state.abb)
rm(state.name)

state.abb <- append(state.abb, c("DC", "PR"))
state.name <- append(state.name, c("District of Columbia", "Puerto Rico"))
state.area <- append(state.area, c("68", "3515"))

state_list <- tibble(state = state.abb, region = tolower(state.name), area = as.numeric(state.area))
```

```{r}
# Re-joining tibbles

fires_states <- fires_small %>%
  left_join(state_list, by = "state")
```


### Normalising States area sizes

```{r}
fires_states %>%
  select(region, area) %>%
  group_by(region, area) %>%
  summarise(num_fires = n()) %>%
  mutate(fires_sqmile = num_fires / area) %>%
  arrange(desc(fires_sqmile))
```

#### This table shows Puerto Rico has the highest proportion of fires compared to its size, followed by New Jersey in the NE of the country and finally by the States in the SE of the country.


```{r}
fires_states %>%
  select(region, area) %>%
  group_by(region, area) %>%
  summarise(num_fires = n()) %>%
  mutate(fires_sqmile = num_fires / area) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = fires_sqmile)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  scale_fill_distiller(name = "Fire per Sq Mile", palette = "PuBuGn") +
  theme_map() + 
  coord_map("mollweide") + 
  ggtitle(paste0("Total US Wildfires per Square Mile from 1992-2015")) + 
  theme(plot.title = element_text(hjust = 0.5))
```

**Puerto Rico is not shown on this map, but visually we can see the data for the other 51 entries, and the south eastern states still have the highest proportion of wildfires.  Interestingly New Jersey also shows has a hotspot in the NE of the country.**


### Do causes change over time?


#### Splitting causes into 2 group for legibility. 

#### The first group is for directly man created fires.

```{r}
fires_states %>%
  select(stat_cause_descr, fire_year) %>%
  group_by(fire_year, stat_cause_descr) %>%
  filter(stat_cause_descr == "Arson" | stat_cause_descr == "Campfire" |
           stat_cause_descr == "Children" | stat_cause_descr == "Equipment Use" |
           stat_cause_descr == "Fireworks" | stat_cause_descr == "Smoking") %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
  geom_line()
```

**The 2 large peaks in Arson are obvious in 1999 and 2006. There was a large heatwave in 2006, but I'm not sure why this would result in an increase in arson.  Unless this was just due to the dry ground creating extra fuel to aid the spread of fires that would have normally not resulted in a large scale fire.  This may also be the same reason that there is also another peak in 2006 for Equipment Use.  Arson however does look to be decreasing since 2006.**


#### And this one for natural occuring fires.

```{r}
fires_states %>%
  select(stat_cause_descr, fire_year) %>%
  group_by(fire_year, stat_cause_descr) %>%
  filter(stat_cause_descr == "Debris Burning" | stat_cause_descr == "Lightning" |
           stat_cause_descr == "Miscellaneous" | stat_cause_descr == 
           "Missing/Undefined" | stat_cause_descr == "Powerline" | 
           stat_cause_descr == "Railroad" | stat_cause_descr == "Structure") %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
  geom_line()
```

**Similar peaks can be seen in Debris, Miscellaneous and lightning in the heatwave of 2006 that left the ground very dry.  There are peaks from 1997 to 2003 in debris, miscellaneous and lightening, but also a trough in missing/undefined, so this is likely to be due to more accurate classification of fires and not using the missing/undefined category as much.**



### Difference in causes between states


```{r}
state_map_southern <- state_map %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" | 
           region == "arkansas" | region == "louisiana")
```


```{r}
fires_states %>%
  filter(fire_year == "1992" | fire_year == "1993" | fire_year == "1994" |
           fire_year == "1995") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 1992-1995") + 
  theme(plot.title = element_text(hjust = 0.5))
```

```{r}
ggsave("1992-1995.png")
```


```{r}
fires_states %>%
  filter(fire_year == "1996" | fire_year == "1997" | fire_year == "1998" |
           fire_year == "1999") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 1996-1999") + 
  theme(plot.title = element_text(hjust = 0.5))
```

```{r}
ggsave("1996-1999.png")
```



```{r}
fires_states %>%
  filter(fire_year == "2000" | fire_year == "2001" | fire_year == "2002" |
           fire_year == "2003") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2000-2003") + 
  theme(plot.title = element_text(hjust = 0.5))
```

```{r}
ggsave("2000-2003.png")
```



```{r}
fires_states %>%
  filter(fire_year == "2004" | fire_year == "2005" | fire_year == "2006" |
           fire_year == "2007") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2004-2007") + 
  theme(plot.title = element_text(hjust = 0.5))
```

```{r}
ggsave("2004-2007.png")
```




```{r}
fires_states %>%
  filter(fire_year == "2008" | fire_year == "2009" | fire_year == "2010" |
           fire_year == "2011") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2008-2011") + 
  theme(plot.title = element_text(hjust = 0.5))
```

```{r}
ggsave("2008-2011.png")
```



```{r}
fires_states %>%
  filter(fire_year == "2012" | fire_year == "2013" | fire_year == "2014" |
           fire_year == "2015") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2012-2015") + 
  theme(plot.title = element_text(hjust = 0.5))
```

```{r}
ggsave("2012-2015.png")
```

**Looking at these trends some interesting insights can be seen.  For the combined years data Florida stands out as having railroad as its main cause of wildfire, but from the above plots it can be seen that these railroad fires are only the main cause up to the 4 yearly period ending in 2003 and then the main cause changes to lightning until the end of the collection period in 2015.  Similarly arson seem reasonably popular in the southern states until 2007, when it no longer appears as the most common cause of wildfire.  This downward trend was also noted earlier in the overall causation plots for all states**



### Correlation between states and fire size

```{r}
fires_states %>%
  select(region, fire_size_class) %>%
  group_by(region, fire_size_class) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = fire_size_class)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Fire Size Class", palette = "PuBuGn") +
  ggtitle("Most common wildfire size per State 1992-2015") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  select(region, fire_size_class) %>%
  filter(fire_size_class == "G") %>%
  group_by(region) %>%
  summarise(num_fire = n()) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = num_fire)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_distiller(name = "Number of Fires", palette = "PuBuGn") +
  ggtitle("Number of large class G fires per State 1992-2015") +
  theme(plot.title = element_text(hjust = 0.5))
```

**From the plots we can see that the Western states have the most small fires and also the most large fires!  Not entirely the most helpful plots...**



### Are fires more prevalent in certain months for individual states

```{r}
fires_states %>%
  filter(fire_year == "1992" | fire_year == "1993" | fire_year == "1994" |
           fire_year == "1995") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 1992-1995") +
  theme(plot.title = element_text(hjust = 0.5))
```



```{r}
fires_states %>%
  filter(fire_year == "1996" | fire_year == "1997" | fire_year == "1998" |
           fire_year == "1999") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 1996-1999") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2000" | fire_year == "2001" | fire_year == "2002" |
           fire_year == "2003") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2000-2003") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2004" | fire_year == "2005" | fire_year == "2006" |
           fire_year == "2007") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2004-2007") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2008" | fire_year == "2009" | fire_year == "2010" |
           fire_year == "2011") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2008-2011") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2012" | fire_year == "2013" | fire_year == "2014" |
           fire_year == "2015") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2012-2015") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2010" | fire_year == "2011" | fire_year == "2012" |
           fire_year == "2013" | fire_year == "2014" | fire_year == "2015") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires in per State 2010-2015") +
  theme(plot.title = element_text(hjust = 0.5))
```


**The above plots are quite interesting.  The months of the year that have the most seems to widely change in certain state.  Mainly the east half of the country have the most fires in the Spring (Feb-May) and the western part of the country have the most fires later on in Summer and Fall (Jun-Oct).  There are however a few exceptions that can be seen in the 2004-2007 and 2008-2011 data Texas has the most fires in January.  Florida also mostly conformed to the East/West split with the majority of its worst months for fires taking place in March or April up until 2007, then the most common month moves later into June and July for the rest of the reporting period until 2015. This may have to due with main cause of fires in Florida changing from railroad to lightning related about the same time, as we noted earlier on when looking at causation.  As July is the main month for tropical storms and lightning in Florida this is a possible cause for the highest month becoming later in the year than before. (2)**


(2) https://www.weather.gov/mlb/fl_lightning_climo


